{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#用管线命令连接多个转换方法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面，让我们用管线命令连接多个转换方法，来演示一个复杂点儿的例子。\n",
    "\n",
    "<!-- TEASER_END -->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##Getting ready"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本主题将再度释放管线命令的光芒。之前我们用它处理缺失数据，只是牛刀小试罢了。下面我们用管线命令把多个预处理步骤连接起来处理，会非常方便。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先，我们加载带缺失值的`iris`数据集："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ nan,  3.5,  1.4,  0.2],\n",
       "       [ 4.9,  3. ,  1.4,  0.2],\n",
       "       [ 4.7,  3.2,  1.3,  0.2],\n",
       "       [ 4.6,  nan,  1.5,  nan],\n",
       "       [ 5. ,  3.6,  1.4,  0.2]])"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "import numpy as np\n",
    "\n",
    "iris = load_iris()\n",
    "iris_data = iris.data\n",
    "\n",
    "mask = np.random.binomial(1, .25, iris_data.shape).astype(bool)\n",
    "iris_data[mask] = np.nan\n",
    "\n",
    "iris_data[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##How to do it..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本主题的目标是首先补全`iris_data`的缺失值，然后对补全的数据集用PCA。可以看出这个流程需要一个训练数据集和一个对照集（holdout set）；管线命令会让事情更简单，不过之前我们做一些准备工作。\n",
    "\n",
    "首先加载需要的模块："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn import pipeline, preprocessing, decomposition"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后，建立`Imputer`和`PCA`类："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pca = decomposition.PCA()\n",
    "imputer = preprocessing.Imputer()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "有了两个类之后，我们就可以用管线命令处理："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-2.44, -0.79, -0.12, -0.1 ],\n",
       "       [-2.67,  0.2 , -0.21,  0.15],\n",
       "       [-2.83,  0.31, -0.19, -0.08],\n",
       "       [-2.35,  0.66,  0.67, -0.06],\n",
       "       [-2.68, -0.06, -0.2 , -0.4 ]])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipe = pipeline.Pipeline([('imputer', imputer), ('pca', pca)])\n",
    "np.set_printoptions(2)\n",
    "iris_data_transformed = pipe.fit_transform(iris_data)\n",
    "iris_data_transformed[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果我们用单独的步骤分别处理，每个步骤都要用一次`fit_transform`，而这里只需要用一次，而且只需要一个对象。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##How it works..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "管线命令的每个步骤都是用一个元组表示，元组的第一个元素是对象的名称，第二个元素是对象。\n",
    "\n",
    "本质上，这些步骤都是在管线命令调用时依次执行`fit_transform`方法。还有一种快速但不太简洁的管线命令建立方法，就像我们快速建立标准化调整模型一样，只不过用`StandardScaler`会获得更多功能。`pipeline`函数将自动创建管线命令的名称："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('imputer',\n",
       "  Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)),\n",
       " ('pca', PCA(copy=True, n_components=None, whiten=False))]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipe2 = pipeline.make_pipeline(imputer, pca)\n",
    "pipe2.steps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这和前面的模型结果一样："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-2.44, -0.79, -0.12, -0.1 ],\n",
       "       [-2.67,  0.2 , -0.21,  0.15],\n",
       "       [-2.83,  0.31, -0.19, -0.08],\n",
       "       [-2.35,  0.66,  0.67, -0.06],\n",
       "       [-2.68, -0.06, -0.2 , -0.4 ]])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris_data_transformed2 = pipe2.fit_transform(iris_data)\n",
    "iris_data_transformed2[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##There's more..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "管线命令连接内部每个对象的属性是通过`set_params`方法实现，其参数用`<对象名称>__<对象参数>`表示。例如，我们设置PCA的主成份数量为2："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('pca', PCA(copy=True, n_components=2, whiten=False))])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipe2.set_params(pca__n_components=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">`__`标识在Python社区读作**dunder**。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里`n_components=2`是`pca`本身的参数。我们再演示一下，输出将是一个$N \\times 2$维矩阵："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-2.44, -0.79],\n",
       "       [-2.67,  0.2 ],\n",
       "       [-2.83,  0.31],\n",
       "       [-2.35,  0.66],\n",
       "       [-2.68, -0.06]])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris_data_transformed3 = pipe2.fit_transform(iris_data)\n",
    "iris_data_transformed3[:5]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
